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Abstract 

We study the problem of sequential prediction of categorical data and 
discuss a generalisation of BlackwelFs algorithm on 0-1 data. The arguments 
are based on Blackwell's approachability results given in pQ. They use mainly 
linear algebra. 



1 Introduction and Background 

Let us consider the problem of sequential prediction of categorical data. Let 
D = {0, 1, . . . , d — 1} denote the set of possible outcomes with d > 2. Let x±, x 2 , . . . 
be an infinite sequence with values in D. Let Y\,Y%,... denote the sequence of 
predictions. This is a random sequence with values in D. Y n+ i predicts x n+ i and 
may depend on the first n outcomes Xi, x 2 , ■ ■ ■ , x n ,Y 1 ,Y 2 , . . . ,Y n and some addi- 
tional random mechanism. Our goal ist to construct a sequential prediction proce- 
dure which works well for all sequences (a^igw m & n asymptotic sense. We intend 
to generalize Blackwell's prediction procedure for two categories. The algorithm 
of Blackwell can be described as follows using Figure Q] below. Let xi,x 2 , ■ ■ ■ be 
an infinite 0-1 sequence. Let x n = - J2k=i x k be the relative frequency of the 
"ones" and 7 n = - Y^k=i ^-{Y k =x k } the relative frequency of correct guesses. Let 
^ = (x n ,l n ) e [°> l f and s = {( x ^v) e [°> !] 2 I 2/ > max(x, 1 - x)}. 
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In Fig. [Tj let Di, D 2 and D 3 be the left, right, and bottom triangles, respectively, 
in the unit square so that D\ = {(x, y) G [0, l] 2 | x < y < 1— x} etc. When fi n G D 3 , 
draw the line through the points /i n and (|, ~) and let (w n ,0) be the point where 
this line crosses the horizontal axis. The Blackwell algorithm chooses its prediction 
Y n+ i on the basis of fi n according to the (conditional) probabilities 

( if n n G D x 
P(Y n+l = 1) = I 1 if /i n G Di 
{ w n if /i n G As- 

When // n is in the interior of S, Y n+ \ can be chosen arbitrarily. Let Y\ = 0. It then 
holds that for the Blackwell algorithm applied to any 0-1 sequence xi,x 2 , ■ ■ ■ the 
sequence n > 1) converges almost surely to S, i.e. dist(// n ,«S) — > as n — > oo 
almost surely. Here dist(-, •) denotes the Euclidean distance from \x n to S. 

As Blackwell once pointed out this is a direct consequence of his Theorem 1 in 
[T] when one chooses the payoff matrix as 

(0,1) (1,0) 
(0,0) (1,1) 

For a quick almost sure argument see [1] . Blackwell also raised the question whether 
his Theorem 1 of [1] applies to sequential prediction when there are more than two 
categories. We shall study this question and finally answer it affirmative. 

We construct a Blackwell type prediction procedure for d > 2 categories by 
choosing the state space and the randomisation rules in a certain way. This proce- 
dure then has similar properties as Blackwell's original one. It also has the feature 
that the (i-category procedure reduces to the (d — 1) category procedure if one 
category is not observed. 

The structure of this paper is as follows. In Section [2] we introduce the appro- 
priate state space and define the randomisation rule. In Section [3] we state the 
convergence result and prove it. For that we shall apply a simplified version of 
Blackwell's Theorem 1 of [TJ, which we also state in Sectional 

This paper is a continuation of [2], where the case d = 3 was discussed, and of 
the diploma thesis of R. Sandvoss [5]. 

We shall use the following notation: Latin letters for points, vectors, and indices, 
greek letters for scalars. We denote components of vectors or points by superindices 
like v = (w {0) , . . . , w {d_1) ) G R d . e = (1, 0, . . . , 0), . . . , e A -i = (0, . . . , 0, 1) denote the 
d- dimensional unit points and 1^ = (1, . . . , 1). The affine subspace of R ' generated 
by the points a , . . . , a n G R d is given by 

a el d a =^ AjOj, ^ Aj = 1, Aj G R, a* G R d ,i = 0, . . . ,n >. 

i=0 i=0 ' 



Blackwell Prediction for Categorical Data 



3 



The convex hull of a ir . . . , a n G R d is given by 
conv({a , . . . ,a„}) 

= | a G R d | a = ELo Aifli, ELo A * = 1, G [0, 1], a< G R d , i = 0, . . . ,n J. 

The Euclidean scalar product on H d is given by (■,■), the Euclidean distance by 
dist(v)- 

2 The Construction of the d- Dimensional Pre- 
diction Procedure 

2.1 The Structure of the Prediction Prism 

For n G IN, x±, x 2 , ■ ■ ■ , x n G D let Yi, Y 2 , . . . , Y n G D denote the corresponding 
predictions. Let x n = (x£\ . . . , with = \ Efc=i ^{^=0' 1 e D > denote 

the vector of the relative frequencies of the n outcomes and 7 n = - Efc=i ^{Y k =x k } 
the relative frequency of correct predictions. 

Let 

d-i 

(g , • • • , qd-i) \qi>0, Eft = 1 



*d-l 



1=0 



denote the unit simple in R d and 

V^ = S d _i x [0,1] = {(g, 7 ) | q G E d -i, < 7 < 1}. 

Since Yli=o x " = 1, we have x n G S d _i and (# n ,7j G W d . Let =5^ = 
{(?, 7) £ Wd I 7 — max; 5™}. We are interested in prediction procedures for which 
fi n := (x n , 7" n ) converges to =5^ for every sequence a:i,x 2 , . . . This means that the 
Euclidean distance dist(/x n , 5?&) — > as n — > 00. 

Unfortunately Blackwell's Theorem 1 of [1] cannot be applied directly. The 
reader may take a look at Theorem 13.31 below which is a simplified version of 
Blackwell's result. The condition (C) there does not hold in general for Wd and 
S d . (To see this, let d = 3, s = (|, §, |, |), //„ = (\, \, §, 0). Then p(/i n ) = and 
s — /i n is not perpendicular to TZ(p(fi n )).) 

The difficulties vanish when one modifies the state space in the right way. Let 
V d = {q + 7 ld I (9,7) e ^4 with 1^ = (1, ... , 1). Then v n := x n + 7 n l d G V d 
for all n. The convergence of fi n to =5^ corresponds to that of v n to S d where 
Sd = {q + ltd G Vd I 7 > max; This follows from the fact that * : Wd ->■ Vd 
with \l/((g, 7 )) = g + 7!^ is an isometric bijection of Wd on V d - We note that for 
z, z' G Wd it holds that 

dist(¥(*), ^')) 2 = f> - ^) 2 + d ■ - ^) 2 - 

i=0 
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To construct the appropriate randomisation regions let us "cut" the prism Vd by 
certain hyperplanes. (This corresponds to splitting the unit square by the diagonals 
in the case of two categories.) 

Let e = (1, 0, 0, . . . , 0), . . . , e d -i = (0, 0, . . . , 0, 1) denote the d-dimensional unit 
points. Let E L = A({e , e z _x, e z + t d , e l+1 , e d _i}), I = 0, . . . , d - 1, denote 
the hyperplanes which contain one vertex of the "upper side" of the prism ei + 1^ 
and (d — 1) vertices ^ ti of S^-x- The d hyperplanes E t cut the prism Vd in 2 d 
pieces, and all contain the point s = (|, |, . . . , |). In this point s the planes E\ are 
all perpendicular to each others. 

This can easily be seen since their corresponding normal vectors are given by 
rii = —ei + This leads to the following characterization of lying "above" 

v lies above Ei ^ (v — rii, rii) < 0. 

In the same way one defines lying below and in E^. 
Now we can describe Sd in two different ways: 

S d = {q + ll d e V d | (q - n h m) > for I = 0, . . . , d - 1} 
= {q + Jt d E V d I 7 > max(g(°), . . . , g^)}. 
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2.2 The Randomisation Rule 

For v n = x n + 7 n ld we will define a d-dimensional random vector p(v n ) G S^-i- It 
plays the same role as w n does in the 0-1 case. With it we define Y n+1 : 

P{{Y n+1 = k})=p^ k \v n )ioikeD. 



Definition 2.1 Let v n G Vd, n G 1ST and lei («o, . . . , ^ e a permutation of 
(0, . . . , d — 1) suc/i i/iai i£ holds: 

(v n - n h ni) < for I = i Q , . . . , z\j 
and (v n - n h n t ) > for I = . . . , z d _x. 

Case 1: Let v n &Va\ Sd- 

Let A 1 = A({^t d , e» i+1 , . . . , ej d _ 1 , u n }) be the affine space of R d generated by the 
points in the waved brackets. Let A 2 = A({e; , . . . , e^.}) denote the corresponding 
affine space. The intersection A\ D A 2 contains exactly one point of Ed_i, we call 
it p(u„). 

Case 2: Let t> n G <9S rf . Let v = #{E k \ v n G E k for fc = 0, . . . , d — 1}. 
Then 

(As) J 1/^ for u n G -Efe 
PK) ~\o for 

for k = 0,1,..., tZ- 1. 

The prediction procedure just defined is called "Generalized Blackwell algo- 
rithm" . 

Remarks 2.2 1) The case v n G Sd\dSd does not occur by the construction of the 
rule. 

2) A2 = cannot occur, since then there exists at least one k G D with 
(v n - n k ,n k ) < 0. 

3) We note that A\ D A 2 contains always just one point ofH d -i- 

J.) For j = d - 1 one obtains A x = A({|l d , v n }), A 2 = A({e io , . . . , e id _J) and 
p{v n ) is the projection along the line, defined by and v n "down" to T^d-i- 

5) For d = 3 the following figure shows the randomisation in a "lower" side piece 
of the prism. Here planes lie above /i n and one below. 
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nearest point in S for v. 
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Figure 4 



3 The Convergence Result 

3.1 Main Result 

Theorem 3.1 Let d>2. Then for the generalized Blackwell algorithm, applied to 
any infinite sequence x\, %2, ■ ■ ■ with values in D, it holds that dist(i> n , Sd) —> with 
probability one as n — >■ oo. 

Now we shall derive Theorem 13.11 by tracing it back to Blackwell's Theorem 1 
of pp. This we first state in a simplified version. 

3.2 Blackwell's Minimax Theorem 

We consider a repeated game of two players with a payoff matrix M = (rriij) with 
niij G R d and 1 < % < r and 1 < j < s. Player I chooses the row, player II the 
column. Let 



the mixed actions of player II. A strategy / in a repeated game for player I is 
a sequence f — (fk] k > 1) with f k G V. A strategy g for player II is defined 
similarly. Two strategies define a sequence of payoffs z k , k = 1,2,... In detail: 
If in the k-th game % and j are choosen according to fk and gk, the payment to 
player I is m^- G R d . Blackwell discussed in [1] the question: Can player I control 




denote the mixed actions of player I and 




s 
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z n — - J2k=i z k with a certain strategy such that z n approaches a given set S 
independently of what player II does? 

Definition 3.2 A set S C R rf is approachable for player I if there exists a strategy 
f* for which dist(z n ,S) — > with probability one. 

Theorem 3.3 (Blackwell) For p G V let 

K(p) = conv (Y^Pirriij] j = 1, 2, . . . , s 

Let S denote a closed convex subs et ofWL d . For every z ^ S let y denote the closest 
point in S to z. We assume: 

(C) For every z G" S there exists a p(z) G V such that the hyperplane through y, 
which is perpendicular to the line segment ~zy, seperates z from TZ(p(z)). 

Then S is approachable for player I. 
3.3 Proof of the Main Result 

To apply Theorem 13.31 to our case, we choose the vertices of Vd as "payments": 



rriij 



ei + t d if i = j, 
ej if i ^ j. 



We choose S as Sd = {q + jld £ Vd | 7 > max; q">}. Then 
Uip) = conv (7 P (i) ej +p ij) (e j + t d ) 

= conv I < y~]p®ej +p^t 



d-1 

d-1 

' j = 0,...,d-l 



conv 



({e j +pUH d \j = 0,...,d-l}). 



It is left to show that condition (C) is fulfilled. 

Let v G Vd \ Sd- We denote by w pro j the closest point in Sd to v. We will show: 
Fact 1 w proj G H{p{v)) 

Fact 2 v — v pro j is perpendicular to A(lZ(p)). Here A(TZ(p)) means the smallest affine 
subspace which contains TZ(p). 
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Both facts together imply condition (C) and finally Theorem 13.11 

For the proofs we shall assume that the following situation holds: For v G Vd\Sd 
it holds 

(v — rii, rii) < for i — 0, . . . , j 
and (v — rii, rii) > for % — j + 1, . . . , d — 1. 



Proof of Fact 1: v lies below Ei for % — 0, 1, . . . , j, but v pro j G t>d- Thus f pro j G 
£ n ■• ■ nEj. Then 



E n • • ■ n Ej = Al |e J+ i, . . . , e d _i, -l t 
Thus 

Voj e ANe i+ i, . . . ,e d _i^l d |j n V d 

C A({ ei +pW(v)l d | i = 0, . . . , d - 1}) n V d = K(p{v)). 

The inclusion follows since p^\v) — for j + 1 = I < d — 1 and = 4 $2i=o ( e » + 
p«l d ). □ 

Fact 2 will be proven by a sequence of lemmata. At first we generate a new 
auxiliary point v which lies in the same plane as p(v). 

Lemma 3.4 For v G V d \ S d let A' = A({v,v pmi }) and A" = 
A({ej + i, . . . , ed-i,p(v)}). Then there exists exactly one point v G A' H A" and 



Proof: Let A\ = A({ej + i, . . . , e^-i, |l<2, v}) as in Definition 12. II Then according to 
Definition 12 . 1 1 p (v) G A\ and ^ pro j G Ai by the proof of Fact 1. Then it follows that 
G A'VA". Here A'VA" denotes the smallest affine space, which contains A', A". 
It holds At = A' V A". Since A' and A" are not parallel it follows that A' n A" ^ 
and by the dimension formula dim(A' PI A") = 0. Hence A' fl A" contains exactly 
one point. We call it v.Ifve S d , then v G S d H A". Then <S d n A(S d _i) ^ 0, which 
is a contradiction to the definitions of Sd and □ 

A direct consequence of Lemma 13.41 is 

Fact 3: a) v pro] = (v) pmi ; 

b) v - Voj -L A(ft(p)) ^ 6 - (C) pro j L A(ft(p)). 
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We shall use Fact 3 to show Fact 2. At first we calculate (w) pro j from v. For 



simplification, we write u pro j instead of (t») pro j from now on. 



Lemma 3.5 



tJ (i) ■ = < 

proj 



d-l 



k=j+l 
d-l 



k=j+l 
k^l 



fori = 0,...,j, 



- d (l- J2 A + /ori = j + l,...,d-l, 



w/iere 6 = p + A j+ i(e i+ i - p) H h Ad-i(e d _i - p) G A". 



Proof: From the proofs of Fact 1 and 3 it follows that 

ltd}) n 5 d . 

The smallest affine space, which contains this set is given by 

A = {a e R d | a = |l d + <W e i+i - fid) + • • • + S d -i{e d -i ~ p-d)} • 
To find Uproj the projection for u on «Sd, we minimize the distance of v to A. 

For a e A 



d(v, a) 



i=0 



d-l 



i=j+i \ v 7 fc=j+i 



(3.1) 



Calculating partial derivatives with respect to 5i, i — j + 1, . . . , d — 1, yields 
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dd(v, a) 2 
d5 t 



did 



z=o \ 

+ g 2 ( sm _|_«( 1 _|) + g, l | la 



2 (iE 



5 (0 



d 



where a = | for / 7^ z, a = — (1 — |) for Z = i, and thus 



cW(/D, a) 



d-l 



^ Si 



5«> 



Z=0 



The determinant of the Hessian is positive which shows that a minimum occurs. 
According to the statement of Lemma 13.51 the components of v has the following 
representation 



d-i 



P {l) [l- Xk ) for/ = 0,...,j, 

V k=j+l J 

Xi for I — j + 1 , . . . , d — 1 , 



(3.2) 



where one should note that = . . . = p( d ^ = 0. 

Plugging in the equation of Si, i = j + 1, . . . , d — 1, and noting that Y2i=oP^ = 1 
leads to 



j / d-l \ d-l 

1- E ^ -E^ 

z=o V fc=j+l / Z=j+1 

\ W ) 



l -~-d^ 



and finally to Si = \. Plugging this in equation (13. ip leads to the statement of the 
Lemma. □ 



Lemma 3.6 It holds: 

(v-v = 



V 



(0 _ - 



d-l 

1- E A * 



projy 



d-l 

E A * 



/or / = 0, . . . , j, 



/or / = j + 1, . . . ,d - 1. 



(3.3) 
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2) The smallest affine subspace which contains TZ(p) can be expressed as x + U 



where one can choose x = v pro j and 



i + p {r) l d - w proj for i = 0, . . . ,j 



e, L + p x 'l d 



fori = j + 1, . . .,d- 1 



as linear generating system of U. 



Proof: Statement 1) is a direct consequence of Lemma 13.51 and (13. ip . Statement 2) 
follows from the fact that C pro j = f pro j E lZ(p(v)) and that lZ(p) = conv(e.j +pWl d 
i = 1, . . . , d - 1) where = . . . = p^ d ~^ = 0. 



Lemma 3.7 It holds 



v - iVoj-Le; + p {l) t d - v pTO] fori = 0,...,j 



Proof: Lemma [3.51 implies 



(ei+p {i h d - 5 proj ) 



(0 



d-l 



k=j+l 



i+p w -| i- E h 

\ k=j+l 

p (i) -lfi- E A fc 



1 = 0,... J; l^i, 



I = i. 



(3.4) 



(1 - 1) A, l = j + l,...,d-l. 
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From (JE3D and flH3D it follows 



y proj 
J 



2W i- EMI' 

fc=j+l 
d-l 



1=0 



d 



d-l 

!- E A * 

fc=J+l 



-1 



7 V fc=i+i / \ V fc=i+i 

d-l / d-l \ / / d-l 

E '-IMkll'- E * 

j=j+i \ fe=j+i 

d-i \ r / 2 

i=0 \ \ fc=j+l 



d 

!-l 

i- E 

k=j+i 



-E 



d 

d-i 



d 



k=j+lk^l 

2 

d 



1 -5.' A ' 



d-l 



p »_±) + y; p (o[ p (o_- h_ £ a, 



d-l 



d-l 



i=0 
d-l 



P 



22 



(0 



fc=j+l / / l=j+l \ \ k=j+l 



d d 

fc=i+i 
d-i 

l- E A * 

k=j+i 



l=j+l 



^-s+^-sli-EM-^-^-E* 



2 
d 



d-l 



fc=j+l 



d 



2 
d 



k=j+i 



d-l 



22 



d-i 



d-l 



22 



+ E^+E^-E^ 



k=j+i 



d ^— ' d d 



□ 



Lemma 3.8 It holds 



v — v 



proj 



ej + v 



proj 



/or i = j + 1, . . . ,d - 1. 
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Proof: By Lemma [3.51 one gets 



i(0 



E A fc ) 1 = 0,. 

k=j+l 

E a*) -(l-l)* z = j + i,...,d 

fc=j+l 

E a, ) - (i - 1) A, z 

fc=i+i 

k^i 



I. 



From and ([33]) it follows 



E [p 



,(0 



d 



d-l 

i- E A * 



d-l 



Z=j+1 



fe=i+i 



d-l 



/ 



2 
d 

( 



2 
d 



d-l 

1- ^ A t 

fc=j+l 



d-l 



d-l 

^ ( 1 - E A * 

fe=i+i 



/ 



2 



V 



i- E 

i+ 
\ 



\ \ k=j+1 i 



d-l 



i- E ^ 

fc=j+i 

V / 



fc=i+i 
i> (0 



d-l 



22 



d-i 



^-EMl+E^-E^ 



22 



fc=j+i 
d-i 



dd 



d-l 



22 



k=j+i 
d-i , 



+ E^( 1 -EM+E^+E 



i=j+i 

d-l 



fc=j+l 



fc=j+l 



d d 



Z=j+1 



d 



k=j+i 



2 
d 



d-i 



,22 



d-l 



d-l 



22 



fc=j+i 



(4 2 ^4 22 2 

+ 2^d Xl ~ ^> dd Xl ~d 

1=3+1 1=3+1 



k=j+l 
0. 



k=j+l 



d d 
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Finally we can state the proof of Fact 2: By Lemma 13.6} 13. 7\ and 13.81 one has 
v — "Sproj -L A(lZ.(p)). By Fact 1 it follows that v — v pro j _L A(lZ(p)). □ 

Acknowledgements. 
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